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Abstract 

Background: High-throughput sequencing of cDNA libraries (RNA-Seq) has proven to be a highly effective 
approach for studying bacterial transcriptomes. A central challenge in designing RNA-Seq-based experiments is 
estimating a priori the number of reads per sample needed to detect and quantify thousands of individual 
transcripts with a large dynamic range of abundance. 

Results: We have conducted a systematic examination of how changes in the number of RNA-Seq reads per 
sample influences both profiling of a single bacterial transcriptome and the comparison of gene expression among 
samples. Our findings suggest that the number of reads typically produced in a single lane of the lllumina HiSeq 
sequencer far exceeds the number needed to saturate the annotated transcriptomes of diverse bacteria growing in 
monoculture. Moreover, as sequencing depth increases, so too does the detection of cDNAs that likely correspond 
to spurious transcripts or genomic DNA contamination. Finally, even when dozens of barcoded individual cDNA 
libraries are sequenced in a single lane, the vast majority of transcripts in each sample can be detected and 
numerous genes differentially expressed between samples can be identified. 

Conclusions: Our analysis provides a guide for the many researchers seeking to determine the appropriate 
sequencing depth for RNA-Seq-based studies of diverse bacterial species. 



Background 

In recent years, high throughput sequencing of cDNA 
libraries (RNA-Seq) has emerged as a powerful technology 
for profiling gene expression, discovering previously unan- 
notated genes, and mapping transcriptome architecture in 
a wide variety of bacterial species [1-11]. RNA-Seq offers 
several advantages over hybridization-based approaches 
such as microarrays, including a markedly higher sensitiv- 
ity for low abundance transcripts, single nucleotide reso- 
lution of transcript boundaries, and the means to profile 
gene expression in strains for which genome sequences 
and/or gene annotations are not available [12,13]. The 
steadily decreasing cost of sequencing, the growing num- 
ber of and accessibility to high-throughput sequencing 
facilities, and the recent development of publicly available 
bioinformatic tools for RNA-Seq data analysis have made 
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RNA-Seq an increasingly attractive and popular method 
for studying bacterial transcriptomes. 

The relative abundances of individual transcripts in a 
bacterial transcriptome can differ by several orders of 
magnitude. In order to generate comprehensive tran- 
scriptome profiles using RNA-Seq one must therefore 
obtain a sufficiently large number of reads to detect 
those biologically relevant transcripts that comprise a 
relatively small proportion of the cDNA library. Detec- 
tion and quantification of low abundance transcripts by 
RNA-Seq can be enhanced in two main ways. First, the 
total number of reads per library can be increased. Second, 
the proportion of reads representing rare transcripts can be 
increased by depleting abundant transcripts from total 
RNA and/or depleting cDNAs representing these abundant 
transcripts from cDNA libraries. This is often achieved by 
targeted removal of ribosomal RNA (rRNA), which com- 
prises 80-95% of bacterial transcriptomes, from total RNA 
prior to cDNA library construction [14,15]. 

For many RNA-Seq-based projects, the budget for 
sequencing costs, and thus the total number of reads 
that can be obtained, is constrained. Thus, researchers 
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designing RNA-Seq experiments must often determine 
the correct balance between sequencing depth (the num- 
ber of reads per sample) and breadth (the number of 
samples sequenced). For some applications of RNA-Seq 
such as transcriptome mapping and annotation, the abi- 
lity to detect rare transcripts is critical, and approaches 
such as the ones described above for increasing the total 
number of biologically relevant reads obtained per sample 
play a central role. For other applications of RNA-Seq 
breadth can often be more important than depth. Specifi- 
cally, for experiments focused on comparing gene expres- 
sion among various strains and/or growth conditions, the 
inclusion of more strains, timepoints, biological replicates, 
and/or growth conditions may be worth the tradeoff of 
lower depth per sample, as it may provide additional bio- 
logical insights and/or statistical confidence that is more 
valuable than the ability to detect low abundance tran- 
scripts in each sample. 

In recent years, methods for incorporating barcoded 
adapters into cDNA libraries have been developed that 
allow reads derived from up to several dozen samples to 
be sequenced in the same lane [16]. This approach, known 
as multiplexing, enables researchers to flexibly vary the 
number of samples sequenced per lane and thus obtain 
the desired balance between the number of samples 
included and the number of reads obtained per sample, in 
particular when number of lanes of sequencing is budget 
limited. However, the extent to which biologically relevant 
information is gained or lost as sequencing depth is varied 
has not been systematically examined. To address this we 
have generated and analyzed a variety of RNA-Seq data- 
sets to determine the number of reads needed to saturate 
the transcriptome of E. coli and examined how reducing 
sequencing depth affects the ability to detect and quantify 
transcripts both within and between samples in diverse 
bacterial species. 

Results 

Ultra-deep sequencing of the £ coli transcriptome 

Previous studies have suggested that accurate quantification 
of > 95% of transcripts in a mammalian cell line (including 
splice junction level quantification) requires -700 million 
reads [17]; however, no estimate of the number of reads 
needed to approach saturation of a bacterial transcriptome 
has been reported. To address this question, we isolated 
total RNA from a log phase culture of Escherichia coli K-12 
which was then depleted of rRNA using the RiboZero kit 
(Epicentere), converted to a strand-specific Illumina cDNA 
library as described [14], and sequenced in one lane of 
Illumina HiSeq. This produced a dataset of more than 306 
million total reads aligning to the E. coli K-12 genome. 
Over 97% of these reads corresponded to properly mapped 
paired end reads, i.e. those corresponding to reads derived 
from opposite ends of the same cDNA mapping no more 



than 450 base pairs apart on the genome (the approximate 
maximum size of cDNAs in the library - see Methods). 
Properly mapped paired end reads were resolved into a 
single fragment by filling in the gap between them (if any). 
For pairs of reads that was not properly mapped, one read 
was discarded and the remaining reads along with unpaired 
reads were each treated as independent fragments. In total 
this dataset contained approximately 156 million aligned 
fragments with an average length of 159 nucleotides. rRNA 
depletion in this sample was nearly complete, with less 
than 0.15% of fragments aligning to rRNA-encoding genes 
(Additional file 1: Table SI). 

The proportion of annotated ORFs represented in 
this dataset was very high, with all but 2 of 4149 ORFs 
annotated in RefSeq covered by at least 1 fragment 
(Additional file 2: Table S2). Coverage of the genome 
also approached saturation, with at least 1 fragment map- 
ping to over 94% of strand-specific genomic positions. 
Importantly, the density of this coverage varied markedly 
among different regions of the genome (Figure 1A). For 
example, while 96% of bases within annotated ORFs were 
detected by 10 or more fragments, only 60% of bases in 
regions antisense to annotated ORFs were detected above 
this cutoff. Similarly, the density of coverage was relatively 
high for genes encoding non-coding RNAs (ncRNAs) and 
relatively low in intergenic regions (Figure 1A). 

As shown in Figure 1A, a surprisingly high proportion 
of antisense and intergenic positions were covered by at 
least 1 fragment. We reasoned that this could be due to 
limitations in the method used to maintain strand speci- 
ficity in our libraries [18,19]. In this method, dUTPs are 
incorporated only into the second strand of cDNAs du- 
ring cDNA synthesis and these dUTPs are then excised 
prior to library amplification, ensuring that only the first 
cDNA strand is efficiently amplified. Incomplete incor- 
poration and/or excision of dUTPs would presumably 
lead to low levels of antisense fragments corresponding 
to the second strand of cDNAs. To assess the level of 
second strand contamination in our samples, we compared 
the average fragment coverage on the sense and antisense 
strands of each annotated ORF with the expectation that 
this coverage should be somewhat correlated if second 
strand removal was incomplete. As shown in Additional 
file 3: Figure SI, there was very little positive correlation 
(R 2 = 0.0004) between the fragment coverage of sense 
and antisense strands, even among highly expressed 
genes. In contrast, the correlation in the coverage of ORF 
sense and antisense strands was much higher (R 2 =0.83) 
when a similar rRNA-depleted E. coli cDNA library was 
not subjected to dUTP excision prior to amplification 
and sequencing. Thus, incomplete strand specificity in 
our libraries does not seem to have contributed sig- 
nificantly to the observed high coverage of antisense 
positions. 
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Figure 1 Coverage of the E. coli K-12 genome by ultra-deep RNA-Seq data. Annotation of genomic positions as antisense ORF, non-coding 
RNA (ncRNAs) intergenic (IGR), and antisense to ORFs or ncRNAs (AS) was based on gene annotations in the RefSeq and Rfam databases. 
Positions on the opposite strand of genes were annotated as antisense only if no other genes were annotated at those positions. A) without 
gDNA subtraction. B) with 0.5% gDNA subtraction. C) with 1% gDNA subtraction. 



Another explanation for the high coverage of antisense 
and intergenic positions observed is that a much higher 
proportion E. coli genome is transcribed than is suggested 
by current gene annotations. Indeed, several recent studies 
have demonstrated widespread transcription from the anti- 
sense strand of protein-encoding genes in diverse bacteria 
[20-23]. While in some cases these antisense transcripts 
have been shown to play important regulatory functions, 
two recent studies in Bacillus subtilis and E. coli K-12 
suggest that many antisense RNAs derive from spurious 
transcription initiation or incomplete transcription ter- 
mination and may not be functionally relevant [24,25]. 
Thus many of the fragments aligning to intergenic regions 
of the genome may correspond to non-specific trans- 
cription initiation or leaky transcription termination of up- 
stream genes. Other sequences from intergenic regions 
may be derived from previously unannotated ncRNAs. 
Recent studies suggest the prevalence of ncRNA genes has 
likely been underestimated, even in well-studied bacteria 
such as E. coli K-12 [7,26]. 

Finally, the nearly complete RNA-Seq read coverage of 
the genome could also reflect contamination of our cDNA 
libraries with a low amount of E. coli genomic DNA 
(gDNA). While total RNA was subjected to 2 rounds of 
DNase treatment and no gDNA was detected following 40 
rounds of PCR prior to cDNA synthesis, it is possible that 
removal of gDNA from our total RNA was not complete. 
Similarly, reagents used after DNase treatment in library 
construction may also have introduced low amounts of 
E. coli gDNA contamination. 

Taken together our findings suggest that a sequencing 
depth of 156 million fragments is sufficient to saturate 



the E. coli K-12 transcriptome but also yields numerous 
fragments aligning to very rare and potentially non- 
functional transcripts and/or to low-level contaminants 
introduced during library construction. 

Genome coverage of RNA-Seq data after background 
subtraction 

While read coverage of annotated E. coli genes was 
nearly complete in the 156M read dataset, the possibility 
of gDNA contamination raised concern that some of 
these genes were not actually transcribed. To better esti- 
mate the proportion of E. coli genes transcribed under 
the conditions tested, we devised an algorithm to sub- 
tract potential gDNA background from our RNA-Seq 
dataset based on the assumption that, unlike reads 
corresponding to cDNAs, the alignment of reads cor- 
responding to gDNA would be uniformly distributed 
across the E. coli genome. As shown in Figure IB and 
1C, background subtraction assuming 0.5% or 1% gDNA 
contamination led to relatively modest decreases in ORF 
and ncRNA coverage but to significant drops in coverage 
of IGR and AS positions. Indeed, after applying a 1% 
background subtraction, only 33% and 62% of AS and 
IGR positions were covered at saturation, respectively, 
compared to 90% and 92% of ORF and ncRNA position, 
respectively. While the actual extent of gDNA contami- 
nation is difficult to ascertain, the results of our PCR 
screen prior to cDNA synthesis suggest it is unlikely to 
be as high as 1%. Yet even with this high level of sub- 
traction, at least 1 and 10 reads aligned to 98% and 95% 
of annotated ORFs, respectively, suggesting that a very 
high proportion of annotated E. coli genes are expressed 
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at least at low levels during exponential growth in rich 
media. 

Of the 100 ORFs to which no reads aligned following 
1% subtraction, several are near the minimum size cutoff 
of cDNAs efficiently maintained during library construc- 
tion. These include 4 of the 5 ibs toxic membrane pro- 
teins that may indeed not be expressed under normal 
growth conditions [27] (Additional file 2: Table S2). 
Importantly, ORFs annotated as "predicted proteins" or 
encoded within annotated prophages were enriched 
more than 2- and 4-fold, respectively, among the un- 
detected ORFs. Moreover, many undetected ORFs were 
clustered in known operons, including 5 of 7 ORFs in 
the rut operon involved in pyrimidine degradation [28], 
5 of 6 ORFs in the cit operon encoding components of 



an inactive citrate lyase [29], and 8 of 15 ORFs in the 
phn operon required for use of phosphonate and phos- 
phite as phosphorous sources [30] (Additional file 2: 
Table S2). Some of the 100 ORFs not represented in our 
RNA-Seq data have been shown to be expressed in other 
studies conducted under different growth conditions, 
suggesting the transcription of these genes is highly 
repressed and/or the half-lives of these transcripts is very 
short during exponential growth of E. coli K-12 in LB 
medium. 

Effect of ribosomal RNA depletion on RNA-Seq 
transcriptome profiles 

We next assessed to what extent rRNA depletion 
increases detection of low expressed transcripts by 
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RNA-Seq. To this end, we constructed another Illumina 
library derived from the same total RNA used to pro- 
duce the initial rRN A- depleted dataset and sequenced 
this library in a single Illumina HiSeq lane. While the 
depleted and undepleted libraries yielded a similar num- 
ber of total fragments, 82% number of fragments in the 
undepleted sample aligned to rRNAs and the number of 
fragments aligning to ORFs in this sample was more 
than 8-fold lower than in the depleted sample. 

As expected, the proportion of annotated ORFs detected 
was higher in the depleted than the undepleted samples 
(Figure 2A). However, even in the undepleted sample, at 
least one fragment mapped to over 99% of annotated ORFs, 
and over 96% of annotated ORFs were associated with 20 
or more fragments. Moreover, the subset of ORFs detected 
with a minimum of 10 fragments per ORF was only 2% 
lower in the undepleted sample. Thus, in a dataset contain- 
ing enough fragments to saturate the E. coli transcriptome, 
the lack of rRNA depletion greatly reduced the number of 
mRNA-derived fragments obtained but led to only a rela- 
tively modest decrease in the proportion of annotated E. 
coli ORFs detected. 

Effect of decreased sequencing depth on RNA-Seq 
transcriptome coverage in E coli 

To systematically assess how decreasing fragment counts 
per sample affects the comprehensiveness of gene expres- 
sion profiles, we developed scripts that randomly sampled 
our 156 million fragment rRNA-depleted E. coli RNA-Seq 
dataset to create datasets with decreasing numbers of 
fragments. The ORF and genome coverage provided by 
these datasets was then quantified and compared. To 
ensure our sampling approach accurately simulated mul- 
tiplexing, we re-sequenced the E. coli cDNA library, this 
time multiplexed with 11 unrelated libraries in the same 
HiSeq lane, producing a dataset with approximately 15 
million total fragments. Importantly, both the levels of 
genome coverage and the number of fragments per ORF 
in this dataset correlated very well (R 2 > 0.99) with those 
of a dataset of 15 million fragments sampled from the 
156 million fragment dataset. 

As shown in Figure 2B, reducing the number of frag- 
ments led to a decrease in the proportion of annotated 
ORFs to which 1 or more fragments aligned. However, 
this decrease was often relatively small compared to the 
reduction in the number of fragments. For example, 
decreasing the number of fragments over 15-fold from 
156 to 10 million fragments led to only a 3% and 7% loss 
in the number of ORFs detected with more than 5 and 
10 fragments, respectively. Indeed, even with only 2 mil- 
lion fragments, 96% and 84% of ORFs were covered by 
at least 1 fragment and 5 fragments, respectively. 

As shown in Figure 3A, positions within annotated 
genes were nearly saturated by 50 million fragments, 



and only relatively incremental increases in annotated 
gene coverage were obtained above 10 million frag- 
ments. A similar trend was observed in intergenic posi- 
tions. As the number of fragment continues to increase 
beyond 50 million, nearly all new positions detected 
were within antisense regions of the genome, many of 
which, as discussed above, may correspond to non- 
functional spurious transcripts or gDNA contamination. 
Indeed, in the background subtracted datasets, very few 
new positions were detected in any category in datasets 
with more than 50 million fragments (Figure 3B and 
3C). Taken together, these findings suggest that 50 mil- 
lion non-rRNA fragments yield nearly complete cover- 
age of biologically relevant E. coli transcripts expressed 
during log phase growth in LB. Moreover, they suggest 
that vast majority of the E. coli transcriptome can be 
detected under this growth condition even with datasets 
of only 5-10 million non-rRNA fragments. 

Effect of decreased sequencing depth on RNA-Seq 
transcriptome coverage in M. tuberculosis and V. cholerae 

The regulatory networks governing gene expression can 
diverge significantly among different bacteria. Moreover, 
patterns of gene expression can vary dramatically among 
different growth conditions. To assess whether the rela- 
tionship between sequencing depth and transcriptome 
coverage described above extends beyond log-phase E. 
coli K-12 cultures growing in LB, we repeated the analysis 
above with RNA-Seq data derived from log phase LB cul- 
tures of Mycobacterium tuberculosis (Figure 4), a species 
whose GC content, gene content and organization, and 
physiology are significantly diverged from those of E. coli. 
Importantly, similar levels of coverage of annotated ORFs 
and ncRNAs were seen in these M. tuberculosis data- 
sets containing 5 and 10 million non-rRNA fragments 
(Figure 4). We also analyzed RNA-Seq datasets containing 
5 and 10 million non-ribosomal fragments derived from 
log phase cultures of Vibrio cholerae growing in M9 min- 
imal medium [3] and found similar levels of gene cover- 
age, though coverage of antisense and intergenic regions 
in these data was somewhat lower (Figure 4). These results 
suggest that a sequencing depth of 5-10 million non- 
rRNA fragments enables profiling of the vast majority of 
transcriptional activity in diverse species grown under di- 
verse culture conditions. 

Using RNA-Seq to identify differentially expressed genes: 
how important is depth? 

In addition to its utility in profiling the transcriptome of a 
single strain of interest, RNA-Seq is also a powerful tool for 
comparing gene expression among different strains and/or 
growth conditions. A recent study by Tarazona et al exa- 
mined the relationship between sequencing depth and the 
reliable identification of changes in gene expression in 
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human RNA-Seq data [31] but to date no similar analysis 
has been conducted for bacterial RNA-Seq data. To assess 
how changes in sequencing depth influence RNA-Seq- 
based analysis of differential gene expression in bacteria, we 
sequenced rRNA-depleted total RNA isolated from LB 
cultures of E. coli 0157:H7 strain EDL933 (from hereon 
referred to as EDL933) at the late exponential and early 



stationary phases. cDNA libraries corresponding to 2 bio- 
logical replicates for each time point were subjected to mul- 
tiplexed sequencing using Alumina HiSeq to yield 25-30 
million fragments per sample. Data between biological 
replicates for each time point was were extremely well 
correlated (R 2 of fragments/ORF = 0.99). To examine the 
impact of having fewer fragments on the results of 
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Figure 4 Coverage of various bacterial genomes by RNA-Seq 
data with varying sequencing depth. Annotations of genomic 
positions was conducted as described in the Figure 1 legend. 5M and 
10M denote RNA-Seq databases with 5 and 10 million non-rRNA 
fragments, respectively. Ec, Mt, and Vc correspond to E coli K-12, 
Mycobacterium tuberculosis, and Vibrio choleroe, respectively. 

differential expression analysis, we scaled down the counts 
of fragments per gene from each dataset while retaining the 
original values of relative gene expression. 

We first used these sampled datasets to determine how 
changes in sequencing depth influenced the detection of 
transcripts in two independent biological replicates. As 
shown in Figure 5, the total number of transcripts 
detected in both replicates rose significantly as depth was 
increased, particularly among lowly expressed genes. Im- 
portantly, these increases began reaching an asymptote 
around 13 million fragments, suggesting that additional 
depth beyond this point did relatively little to increase 
either the number or percent of all genes detected in both 
biological replicates. 

We next analyzed the full and sampled datasets with 
DESeq, a variance-analysis package that uses a model 
based on the negative binomial distribution to infer 
statistically significant differences in gene-expression 
from RNA-Seq data [32]. Based on the counts of gene- 
mapped fragments derived from the full RNA-Seq data 
set of -25 million fragments per sample, DESeq identi- 
fied 2486 genes (corresponding to 45% of all annotated 
EDL933 genes) as being at least 2-fold up- or down- 
regulated (P < lxlO" 3 ). As shown in Figure 6A, reduc- 
tions in sequencing depth correlated with a decrease in 
the number of genes identified as differentially expressed 
below this P-value cutoff. As expected, the effect of 
decreased depth was most marked for genes whose 
differential abundance between the two growth phases 
was relatively small (Figure 6A). For example, a 10-fold 
decrease in depth resulted in a loss of 38% of genes 
2-5-fold differentially expressed but only 9% of genes 
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Figure 5 Effect of sequencing depth on the detection of 
transcripts in two independent biological replicates. The 

percent of all annotated ORFs detected by RNA-Seq above the 
indicated RPKM thresholds in both replicates of EDL933 exponential 
phase cultures. 
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whose differential expression was greater than 10-fold. 
However, even when the depth was reduced to 2.5-3 
million fragments in each dataset, 1704 genes were iden- 
tified as differentially regulated more than 2-fold with 
P < lxlO" 3 . Our findings indicate that when data from 
well-correlated biological replicates are included, 2-3 
million fragments per sample enable a significant num- 
ber of genes differentially expressed by 2-fold or more to 
be identified with high statistical significance. 

The ability to reliably identify differentially expressed 
genes by RNA-Seq is affected by a variety of factors 
aside from total sequencing depth that can vary signifi- 
cantly from one experiment to another, including the 
number of biological replicates included and the vari- 
ation between them, the average abundance of differen- 
tially expressed genes, and the magnitude of their 
differential expression under the conditions tested. We 
therefore repeated the analysis above with RNA-Seq data 
that were distinct in several ways from the EDL933 data. 
Specifically, these data were derived from V. cholerae 
growing in M9 minimal medium or isolated from the 
cecal fluid of 2 orally infected infant rabbits [3]. More- 
over, the correlation between the 2 rabbit samples was 
much lower than for the EDL933 in vitro samples 
(R 2 =0.69). Finally, the total number of non-rRNA frag- 
ments for these datasets was between 4 and 6 million, 
significantly less than in the EDL933 datasets. 

Despite these numerous differences, the impact of 
reducing the number of fragments in the V. cholerae 
and EDL933 datasets on the ability to detect differen- 
tially expressed genes was very similar (Figure 6B). Im- 
portantly, as we observed in the analysis of the EDL933 
data, numerous genes were identified as differentially 
expressed by at least 2-fold (P < lxlO" 3 ) even with a 
reduction of fragments to 2-3 million per sample. These 
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Figure 6 Effect of decreased sequencing depth on detection of differentially expressed genes by RNA-Seq. Differentially expressed genes 
were identified by DESeq with P < 10" 3 . A) Comparison of EDL933 gene expression in exponential and stationary phase. The total number of 
aligned non-rRNA fragments in these datasets ranged from 25-30 million. B) Comparison of V. choleroe gene expression in minimal media and 
the rabbit cecum. The total number of aligned non-rRNA fragments in these datasets ranged from 4-6 million. 



included all 16 of the major V. cholerae colonization or 
virulence factors identified as induced in the rabbit when 
the full datasets were compared [3]. Strikingly, the differ- 
ential expression of all but one of these major colonization 
and virulence factors was detected (P < lxlO" 3 ) even when 
the total number of non-rRNA fragments was reduced 
100-fold to 40,000-60,000 total fragments. While it is not 
possible to accurately simulate how changes in depth will 
affect RNA-Seq comparative gene expression analyses in 
all cases, our findings indicate that in diverse species and 
growth conditions and even with relatively low correlation 
between biological replicates, 2-3 million fragments per 
sample enable a significant number of genes differentially 
expressed by 2-fold or more to be identified with high 
statistical significance. 

Discussion 

We have conducted a systematic analysis of how 
changes in sequencing depth affect analysis of bacterial 
RNA-Seq data, both for profiling gene expression in a 
single sample and for comparing gene expression among 
different strains and/or growth conditions. Our findings 
suggest that 5-10 million non-rRNA fragments are 



sufficient to detect all but a few of the most low 
expressed genes in diverse bacteria growing under a 
variety of conditions. Moreover, we found that when the 
number of non-rRNA fragments in E. coli exceeds 50 
million, detection of biologically relevant transcripts all 
but ceases and much of the additional coverage gained 
appears to represent very rare transcriptional events 
and/or gDNA contamination. We also found that when 
RNA-Seq data from biological replicates is available, dif- 
ferential expression of numerous genes can be detected 
with high statistical significance even when the number 
of fragments per sample is reduced to 2-3 million. 

The optimal sequencing depth for an RNA-Seq based 
study will vary considerably based on the scientific 
objective of that study. For applications requiring a com- 
prehensive transcriptome profile, coverage exceeding 10 
million fragments per sample may be needed, with the 
understanding that increasing depth can lead to detection 
of sequences that may not represent bona fide transcripts. 
Alternatively, the number and diversity of growth condi- 
tions included in the analysis can be increased with the 
expectation that, while the number of reads per sample 
will be decreased, numerous transcripts whose abundance 
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is low under one condition will be more highly expressed 
and thus easier to detect under another condition. For 
applications aimed at discovery of a relatively small num- 
ber of previously unannotated genes to be subjected to ex- 
perimental validation and/or functional characterization, 
lower sequencing depth can provide sufficient sensitivity. 
Indeed, a depth of 4 million non-rRNA fragments was 
sufficient for identification of several dozen previously 
unannotated ncRNAs in V. cholerae [3]. Similarly, even 
with only 25,000-30,000 non-rRNA fragments per sample 
we were able to identify 184 annotated genes in EDL933 
whose abundance differed more than 2-fold between late 
exponential and early stationary phases (P < lxlO 5 ). Thus, 
our findings suggest that for many RNA-Seq based studies 
in bacteria, the number of fragments needed to profile 
gene expression in a single rRNA-depleted sample isolated 
from a bacterial monoculture is far less than that pro- 
duced in a single Illumina HiSeq lane. Indeed, our findings 
suggest that at a certain point increased sequencing depth 
may actually be detrimental to the accurate mapping of 
biologically relevant transcripts, yielding reads that likely 
represent contaminants in the cDNA library or the pro- 
ducts of spurious transcriptional events. 

A HiSeq lane typically produces about 150 million 
paired end reads under current run conditions. Thus, 
multiplexing 15-30 samples per lane will yield the 5-10 
million reads per sample that are sufficient for most 
applications of bacterial RNA-Seq. Indeed, our findings 
suggest that for studies of differential gene expression, 
even significantly higher levels of multiplexing result in 
relatively modest decreases in sensitivity. For these types 
of studies, the added biological information provided by 
the inclusion of more strains, growth conditions, and/or 
biological replicates may outweigh this loss of sensitivity 
for detecting transcriptional changes in each pairwise 
comparison of samples. Our findings also suggest that 
for studies in which only a few samples are to be 
sequenced in a single lane, a sufficient number of reads 
may be obtained for samples that are not depleted of 
rRNA and thus the time and cost associated with rRNA- 
depletion may not be justified. Finally, for studies invol- 
ving only one or two samples, such as pilot or proof-of- 
principle experiments, lower throughput platforms such 
as Illumina MiSeq platform may be more appropriate 
than the HiSeq platform. MiSeq yields only about 7.5 
million paired end reads per lane with a only a slightly 
lower reagent cost than a lane of HiSeq but produces 
data in a fraction of the time needed for a HiSeq run, 
making it a good option for those seeking to quickly 
obtain profiles of gene expression in only a few rRNA- 
depleted samples. 

The analysis we conducted was largely limited to data 
derived from single bacterial strains grown in culture. 
However, RNA-Seq is increasingly being used to study 



the transcriptomes of bacteria growing in animal hosts 
and/or as part of complex bacterial communities. Sam- 
ples isolated from animal models are often contaminated 
with a large amount of host RNA. In RNA derived from 
microbial communities, transcripts corresponding to 
particular strains of interest will often be greatly out- 
numbered by those expressed by the numerous other 
members of the community. Thus, in RNA-Seq data 
representing mixed samples, the number of reads corre- 
sponding to transcripts of interest can be orders of mag- 
nitude lower than in data derived from a homogeneous 
bacterial culture. Using RNA-Seq to unravel the dyna- 
mics of bacterial gene expression in these complex and 
biologically relevant samples will therefore require sig- 
nificantly greater sequencing depth per sample, a robust 
depletion of bacterial rRNA, host rRNA, and host 
mRNA, and/or enrichment for transcripts of interest 
through methods such as hybrid capture. 

Conclusion 

We have conducted a systematic analysis of how 
changes in sequencing depth influence the profiling and 
comparison of transcriptomes by RNA-Seq in diverse 
bacterial species and growth conditions. Our findings 
provide a guide for determining the appropriate sequen- 
cing depth for a wide variety of RNA-Seq-based studies 
of bacterial gene expression. 

Methods 

RNA extraction and processing 

RNA was isolated by incubation by TRIzol (Invitrogen) 
followed by passage through Direct-zol columns (Zymo 
Research). Isolation of M. tuberculosis RNA included bead 
beating during incubation with TRIzol [33]. Total RNA was 
depleted of ribosomal RNA using the Ribo-Zero rRNA 
Removal Gram-negative Kit (for E. coli and EDL933) and 
Gram-negative Kit (for M. tuberculosis) (Epicentre) accor- 
ding to the manufacturer s protocol. mRNA-enriched RNA 
isolated using Zymo RNA Clean & Concentrator columns 
(Zymo Research) and treated with DNase using the 
TURBO DNA-free kit (Ambion) according to the manufac- 
turer's protocol. The RNA was then fragmented in a reac- 
tion with 5X Fragmentation Buffer (Affymetrix) heated at 
80°C for 6 minutes and purified using the Zymo RNA 
Clean & Concentrator columns (Zymo Research). 

cDNA synthesis 

Unless otherwise indicated, all reagents in this section were 
obtained from Invitrogen. For first strand cDNA synthesis, 
RNA was incubated with random hexamers at 70°C for 10 
minutes and then chilled on ice. The primer and RNA 
template mix was then added to 5X FS Buffer, 0.1 M DTT, 
10 mM dNTP mix, Actinomycin D (Sigma-Aldrich), 
Superase-in (Ambion), and Superscript III. This reaction 
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was incubated at 25°C for 10 minutes and at 55°C for 1 
hour, then chilled for 5 minutes on ice and cleaned up 
using Zymo RNA Clean & Concentrator (Zymo Research). 
The second strand cDNA synthesis reaction contained the 
product of the first strand synthesis reaction, 5X FS Buffer, 
5X SS Buffer, 0.1 M DTT, 10 mM dUTP mix (Affymetrix/ 
USB), RNase H, DNA Ligase (NEB), and E coli DNA poly- 
merase I (NEB). This reaction was incubated at 16°C for 

2 hrs then placed on ice and terminated with 10 ul of 0.5 
M EDTA. cDNA was then isolated from this reaction using 
the MinElute PCR Purification Kit (Qiagen). 

Illumina library construction and sequencing 

cDNA fragments were end-repaired and phosphorylated, 
followed by adenylation of 3/ends and adapter ligation as 
described [34] with the exception of replacing standard 
paired end adapters with forked adapters containing 
unique 8 base index sequences. Samples were gel size- 
selected for 150-450bp fragment size (4% agarose, 85V, 

3 hours.) Size-selected adaptor ligated cDNA was preincu- 
bated with 1 ml Uracil-Nglycosylase (Applied Biosystems) 
at 37°C for 15 minutes to remove uracils from the second 
cDNA strand. Following incubation at 95°C for 5 minutes, 
each sample underwent 18 cycles of PCR in 4 duplicate 
reactions. Each set of 4 reactions was then combined and 
purified using MinElute columns (Qiagen). Purified lib- 
raries were profiled using the Agilent Bioanalyzer and 
sequenced using the Illumina Hi-Seq platform to yield 
76- 10 lb paired end reads. 

RNA-Seq data analysis 

Reads were aligned to RefSeq reference genomes 
(see Additional file 1: Table SI) using BWA [35] version 
5.9. Gene annotations were obtained from RefSeq and 
Rfam [36]. The overall fragment coverage of genomic 
regions corresponding to features such as ORFs and 
rRNAs was conducted as described [3]. 

In calculating the number of fragments aligning to each 
feature, the paired-end strand-specific RNA-Seq reads were 
assigned to these features based on their overlapping 
genomic coordinates and strand orientation using a custom 
PERL script. Counts of RNA-Seq fragments were computed 
for each feature based on the paired-read mappings. 
Fragments aligning to the DNA strand opposite from the 
transcribed orientation of corresponding annotated features 
were classified and counted as antisense. In the minority of 
cases where only one read of a pair aligned to the genome, 
the entire fragment was assigned to the overlapping feature. 
Where each paired read of individual fragments aligned to 
different features, each feature was assigned a partial frag- 
ment count corresponding to 1/ (number of mapped fea- 
tures). Differentially expressed genes were identified using 
the feature-assigned fragment counts for each replicate as 
input to the DESeq software [32]. 



Genome sequence coverage by RNA-Seq alignments 
was computed using a custom PERL script, where the 
strand-specific nucleotide coverage (C) was incremented 
at each nucleotide position spanned by a read or across 
the range covered by the boundaries of an RNA-Seq 
fragment inferred from a pair of properly mated paired 
end reads. Background subtraction assuming a given 
percent of genomic DNA contamination (pctBkg) was 
performed as follows. The total strand-specific coverage 
was computed by summing strand-specific nucleotide- 
level coverage (Csum) observed across the genome. The 
expected nucleotide-level coverage due to genomic DNA 
contamination (Cbkg) was computed as: 

Cbkg = Csum * (pctBkg/ '1000) 

The effective nucleotide-level background-subtracted 
coverage (Ceff) values were computed as follows: 

Ceff ={ (C - Cbkg <= 0) : 0, 

(C - Cbkg >= 1) : floor (C - Cbkg), 

(0 < C - Cbkg < 1) : 1 

with probability (C — Cbkg) else 0} 
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Additional file 1: Table SI. Refseq accession numbers for strains 
included in this study. 
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Additional file 3: Figure SI. Correlation of coverage of the sense and 
antisense strands of annotated ORFs. 
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